The LinGO Redwoods Treebank: Motivation and Preliminary Applications
نویسندگان
چکیده
The LinGO Redwoods initiative is a seed activity in the design and development of a new type of treebank. While several mediumto large-scale treebanks exist for English (and for other major languages), pre-existing publicly available resources exhibit the following limitations: (i) annotation is mono-stratal, either encoding topological (phrase structure) or tectogrammatical (dependency) information, (ii) the depth of linguistic information recorded is comparatively shallow, (iii) the design and format of linguistic representation in the treebank hard-wires a small, predefined range of ways in which information can be extracted from the treebank, and (iv) representations in existing treebanks are static and over the (often yearor decade-long) evolution of a large-scale treebank tend to fall behind the development of the field. LinGO Redwoods aims at the development of a novel treebanking methodology, rich in nature and dynamic both in the ways linguistic data can be retrieved from the treebank in varying granularity and in the constant evolution and regular updating of the treebank itself. Since October 2001, the project is working to build the foundations for this new type of treebank, to develop a basic set of tools for treebank construction and maintenance, and to construct an initial set of 10,000 annotated trees to be distributed together with the tools under an open-source license. 1 Why Another (Type of) Treebank? For the past decade or more, symbolic, linguistically oriented methods and statistical or machine learning approaches to NLP have often been perceived as incompatible or even competing paradigms. While shallow and probabilistic processing techniques have produced useful results in many classes of applications, they have not met the full range of needs for NLP, particularly where precise interpretation is important, or where the variety of linguistic expression is large relative to the amount of training data available. On the other hand, deep approaches to NLP have only recently achieved broad enough grammatical coverage and sufficient processing efficiency to allow the use of precise linguistic grammars in certain types of real-world applications. In particular, applications of broad-coverage analytical grammars for parsing or generation require the use of sophisticated statistical techniques for resolving ambiguities; the transfer of Head-Driven Phrase Structure Grammar (HPSG) systems into industry, for example, has amplified the need for general parse ranking, disambiguation, and robust recovery techniques. We observe general consensus on the necessity for bridging activities, combining symbolic and stochastic approaches to NLP. But although we find promising research in stochastic parsing in a number of frameworks, there is a lack of appropriately rich and dynamic language corpora for HPSG. Likewise, stochastic parsing has so far been focussed on information-extraction-type applications and lacks any depth of semantic interpretation. The Redwoods initiative is designed to fill in this gap. In the next section, we present some of the motivation for the LinGO Redwoods project as a treebank development process. Although construction of the treebank is in its early stages, we present in Section 3 some preliminary results of using the treebank data already acquired on concrete applications. We show, for instance, that even simple statistical models of parse ranking trained on the Redwoods corpus built so far can disambiguate parses with close to 80% accuracy. 2 A Rich and Dynamic Treebank The Redwoods treebank is based on open-source HPSG resources developed by a broad consortium of research groups including researchers at Stanford (USA), Saarbrücken (Germany), Cambridge, Edinburgh, and Sussex (UK), and Tokyo (Japan). Their wide distribution and common acceptance make the HPSG framework and resources an excellent anchor point for the Redwoods treebanking initiative. The key innovative aspect of the Redwoods approach to treebanking is the anchoring of all linguistic data captured in the treebank to the HPSG framework and a generally-available broad-coverage grammar of English, the LinGO English Resource Grammar (Flickinger, 2000) as implemented with the LKB grammar development environment (Copestake, 2002). Unlike existing treebanks, there is no need to define a (new) form of grammatical representation specific to the treebank. Instead, the treebank records complete syntactosemantic analyses as defined by the LinGO ERG and provide tools to extract different types of linguistic information at varying granularity. The treebanking environment, building on the [incr tsdb()] profiling environment (Oepen & Callmeier, 2000), presents annotators, one sentence at a time, with the full set of analyses produced by the grammar. Using a pre-existing tree comparison tool in the LKB (similar in kind to the SRI Cambridge TreeBanker; Carter, 1997), annotators can quickly navigate through the parse forest and identify the correct or preferred analysis in the current context (or, in rare cases, reject all analyses proposed by the grammar). The tree selection tool presents users, who need little expert knowledge of the underlying grammar, with a range of basic properties that distinguish competing analyses and that are relatively easy to judge. All disambiguating decisions made by annotators are recorded in the [incr tsdb()] database and thus become available for (i) later dynamic extraction from the annotated profile or (ii) dynamic propagation into a more recent profile obtained from re-running a newer version of the grammar on the same corpus. Important innovative research aspects in this approach to treebanking are (i) enabling users of the treebank to extract information of the type they need and to transform the available representation into a form suited to their needs and (ii) the ability to update the treebank with an enhanced version of the grammar in an automated fashion, viz. by re-applying the disambiguating decisions on the corpus with an updated version of the grammar. Depth of Representation and Transformation of Information Internally, the [incr tsdb()] database records analyses in three different formats, viz. (i) as a derivation tree composed of identifiers of lexical items and constructions used to build the analysis, (ii) as a traditional phrase structure tree labeled with an inventory of some fifty atomic labels (of the type ‘S’, ‘NP’, ‘VP’ et al.), and (iii) as an underspecified MRS (Copestake, Lascarides, & Flickinger, 2001) meaning representation. While representation (ii) will in many cases be similar to the representation found in the Penn Treebank, representation (iii) subsumes the functor – argument (or tectogrammatical) structure advocated in the Prague Dependency Treebank or the German TiGer corpus. Most importantly, however, representation (i) provides all the information required to replay the full HPSG analysis (using the original grammar and one of the open-source HPSG processing environments, e.g., the LKB or PET, which already have been interfaced to [incr tsdb()]). Using the latter approach, users of the treebank are enabled to extract information in whatever representation they require, simply by reconstructing full analyses and adapting the existing mappings (e.g., the inventory of node labels used for phrase structure trees) to their needs. Likewise, the existing [incr tsdb()] facilities for comparing across competence and performance profiles can be deployed to evaluate results of a (stochastic) parse disambiguation system, essentially using the preferences recorded in the treebank as a ‘gold standard’ target for comparison. Automating Treebank Construction Although a precise HPSG grammar like the LinGO ERG will typically assign a small number of analyses to a given sentence, choosing among a few or sometimes a few dozen readings is time-consuming and error-prone. The project is exploring two approaches to automating the disambiguation task, (i) seeding lexical selection from a part-ofspeech (POS) tagger and (ii) automated inter-annotator comparison and assisted resolution of conflicts. Treebank Maintenance and Evolution One of the challenging research aspects of the Redwoods initiative is about developing a methodology for automated updates of the treebank to reflect the continuous evolution of the underlying linguistic framework and of the LinGO grammar. Again building on the notion of elementary linguistic discriminators, we expect to explore the semiautomatic propagation of recorded disambiguating decisions into newer versions of the parsed corpus. While it can be assumed that the basic phrase structure inventory and granularity of lexical distinctions have stabilized to a certain degree, it is not guaranteed that one set of discriminators will always fully disambiguate a more recent set of analyses for the same utterance (as the grammar may introduce new ambiguity), nor that re-playing a history of disambiguating decisions will necessarily identify the correct, preferred analysis for all sentences. A better understanding of the nature of discriminators and relations holding among them is expected to provide the foundations for an update procedure that, ultimately, should be mostly automated, with minimal manual inspection, and which can become part of the regular regression test cycle for the grammar. Scope and Current State of Seeding Initiative The first 10,000 trees to be hand-annotated as part of the kick-off initiative are taken from a domain for which the English Resource Grammar is known to exhibit broad and accurate coverage, viz. transcribed face-to-face dialogues in an appointment scheduling and travel arrangement domain.1 For the follow-up phase of the project, it is expected to move into a second domain and text genre, presumably more formal, edited text taken from newspaper text or another widely available on-line source. As of June 2002, the seeding initiative is well underway. The integrated treebanking environment, combining [incr tsdb()] and the LKB tree selection tool, has been established and has been deployed in a first iteration of annotating the VerbMobil utterances. The approach to parse selection through minimal discriminators turned out to be not hard to learn for a second-year Stanford undergraduate in linguistics, and allowed completion of the first iteration in less than ten weeks. Table 1 summarizes the current Redwoods status. 1Corpora of some 50,000 such utterances are readily available from the VerbMobil project (Wahlster, 2000) and have already been studied extensively among researchers world-wide. 2Of the four data sets only VM32 has been double-checked by an expert grammarian and (almost) completely disambiguated to date; therefore it exhibits an interestingly higher degree of phrasal ambiguity in the ‘active = 1’ subset. total active = 0 active = 1 active > 1 unannotated corpus ] ‖ × ] ‖ × ] ‖ × ] ‖ × ] ‖ × VM6 2422 7·7 4·2 32·9 218 8·0 4·4 9·7 1910 7·0 4·0 7·5 80 10·0 4·8 23·8 214 14·9 4·3 287·5 VM13 1984 8·5 4·0 37·9 175 8·5 4·1 9·9 1491 7·2 3·9 7·5 85 9·9 4·5 22·1 233 14·1 4·2 22·1 VM31 1726 6·2 4·5 22·4 164 7·9 4·6 8·
منابع مشابه
LinGO Redwoods A Rich and Dynamic Treebank for HPSG
The LinGO Redwoods initiative is a seed activity in the design and development of a new type of treebank. A treebank is a (typically hand-built) collection of natural language utterances and associated linguistic analyses; typical treebanks—as for example the widely recognized Penn Treebank (Marcus, Santorini, & Marcinkiewicz, 1993), the Prague Dependency Treebank (Hajic, 1998), or the German T...
متن کاملTowards Holistic Grammar Engineering and Testing
We present a new methodology for the semiautomated maintenance of a treebank built from analyses of a computational grammar and gauge the effort required for each update cycle. Based on a decade of large-scale grammar engineering experience, we propose a tight integration of treebank maintenance with the continuous evolution of a ‘deep’ computational grammar. 1 Background & Motivation Moving (o...
متن کاملGenerating Semantic Graphs through Self-Organization
In this study, a technique called semantic self-organization is used to scale up the subsymbolic approach by allowing a network to optimally allocate frame representations from a semantic dependency graph. The resulting architecture, INSOMNet, was trained on semantic representations of the newly-released LinGO Redwoods HPSG Treebank of annotated sentences from the VerbMobil project. The results...
متن کاملWho Did What to Whom? A Contrastive Study of Syntacto-Semantic Dependencies
We investigate aspects of interoperability between a broad range of common annotation schemes for syntacto-semantic dependencies. With the practical goal of making the LinGO Redwoods Treebank accessible to broader usage, we contrast seven distinct annotation schemes of functor–argument structure, both in terms of syntactic and semantic relations. Drawing examples from a multi-annotated gold sta...
متن کاملDiscriminant-Based MRS Banking
We present an approach to discriminant-based MRS banking, i.e. the construction of an annotated corpus where each input item is paired with a logical-form semantics. Semantic annotations are produced by parsing with a broad-coverage precision grammar, followed by manual disambiguation. The selection of the preferred analysis for each item (and hence its semantic form) builds on a notion of sema...
متن کامل